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Abstract.  Many  important  algorithms  in  signal  and  image  processing,  speech  and  pattern  recogni¬ 
tion  or  matrix  computations  consist  of  coupled  systems  of  recurrence  equations.  Systolic  arrays  are 
regular  networks  of  tightly  coupled  simple  processors  with  limited  storage  that  provide  cost-effective 
high-throughput  implementations  of  many  such  algorithms.  While  there  are  some  mathematical 
techniques  for  finding  efficient  schedules  for  uniform  recurrence  equations,  there  is  no  general  theory 
for  more  general  systems  of  recurrence  equations.  The  first  elements  of  such  a  theory  are  presented 
in  this  paper  and  constitute  a  significant  step  towards  establishing  a  complete  methodology  that 
determines  systolic  array  implementations  for  a  very  general  class  of  coupled  systems  of  recur¬ 
rence  equations;  these  implementations  exhibit  provably  optimal  computation  time  while  satisfying 
various  user-specified  constraints. 
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1.  Introduction 


Systolic  arrays,  regular  networks  of  tightly  coupled  simple  processors  with  limited  storage 
provide  cost-effective  high-throughput  implementations  of  important  algorithms  in  a  variety  of 
areas  (signal  and  image  processing,  speech  and  pattern  recognition,  matrix  computations).  The 
simplest  of  these  algorithms  consist  of  a  single  recurrence  equation  while  most  of  them  are  fairly 
complex  systems  of  coupled  recurrence  equations.  A  recurrence  equation  specifies  the  computation 
of  a  variable  at  various  index  values  as  a  function  of  other  indexed  variables.  In  this  paper,  we 
present  the  first  steps  towards  a  mathematical  theory  for  the  systematic  implementation  of  coupled 
systems  of  recurrence  equations  on  systolic  arrays.  The  resulting  arrays  exhibit  provablv  optimal 
computation  time  while  satisfying  various  user-specified  constraints.  Our  theory  employs  several 
concepts  introduced  in  early  works  by  Karp,  Miller  and  Winograd  [3,  4], 

The  starting-point  for  the  systematic  systolic  implementation  of  recurrence  equations  was 
made  with  methods  developed  for  the  simplest  class  of  equations.  This  is  the  class  for  which  the 
difference  between  the  indices  of  a  used  variable  and  the  indices  of  the  computed  variable  within  a 
recurrence  equation  is  constant  (uniform  recurrence  equations  [7]  and  regular  iterative  algorithms 
[8]).  Recently,  extensions  have  been  attempted  to  cover  the  implementation  of  a  larger  class  of 
equations:  those  containing  variables  with  differing  numbers  of  indices  (e.g.  [6])  or  functions  with 
large  fan-ins,  encountered  in  dynamic  programming,  for  instance.  These  attempts,  however,  lack  the 
rigour  found  in  the  afore-mentioned  theories  for  uniform  recurrence  equations;  they  are  expedients 
with  a  limited  and  not  clearly  delineated  domain  of  application. 

For  more  general  systems  of  equations,  a  mathematical  model  must  be  established  that  captures 
the  main  properties  of  the  recurrence  equations  under  consideration,  and  mathematical  methods 
consistent  with  the  model  should  be  used  to  perform  a  computability  analysis  and  to  determine 
efficient  schedules.  These  recurrence  equations  are  characterised  by  differences  between  the  indices 
of  computed  and  used  variables  that  are  affine,  rather  than  constant,  functions  of  the  indices.  The 
key  uniformity  property  exploited  in  [4,  7,  8]  is  thus  lost.  Previous  works  tried  to  convert  the 
affine  equations  to  uniform  ones,  thus  attempting  to  render  the  scheduling  problem  amenable  to 
the  techniques  developed  for  uniform  recurrence  equations.  We  propose  instead  a  theory  that,  for 
arbitrary  affine  functions  of  the  indices,  parallels  at  a  fundamental  level  the  work  in  [4].  When  fully 


developed,  it  will  permit  the  automatic  determination  of  efficient  implementations  for  the  types  of 
algorithms  of  concern  in  previous  extensions  as  well  as  for  some  algorithms  (of  proven  importance 
in  signal  processing)  that  do  not  fall  into  the  categories  covered  by  previous  extensions. 

We  believe  that  the  proper  implementation  of  recurrence  equations  on  systolic  arrays  (or  any 
other  parallel  architecture,  for  that  matter)  necessitates  a  determination  of  the  computability  of 
the  equations.  The  information  gathered  during  the  verification  phase  can  then  be  profitably  used 
in  the  scheduling  and  processor  assignment  phase. 

In  Section  2  we  introduce  the  concept  of  computability  of  a  variable:  a  variable  is  not  com¬ 
putable  if  there  exists  an  index  value  P  so  that  up  depends  on  itself,  possibly  through  a  number 
of  intermediate  variables.  The  renaming  algorithm  presented  in  Sections  3  and  4  identifies  and 
extracts  (i.e.  gives  a  new  name  to)  those  "harmless"  instances  of  each  variable  that  can  never  be 
responsible  for  a  self-reference.  The  variables  associated  with  the  remaining,  potentially  harmful, 
instances  form  groups  (the  "steps'  of  the  algorithm)  in  which  they  mutually  depend  on  each  other. 
Hence,  each  variable  u  in  a  step  gives  rise  to  "cyclic"  dependences  of  the  form:  up  depends  on  uq. 
A  computability  analysis  then  determines  whether  Q  happens  to  be  equal  to  P  for  some  P. 

A  direct  application  of  the  definition  of  computability  would  lead  to  the  exhaustive  examina¬ 
tion  of  possible  self-dependences  for  all  instances  of  a  variable  u  in  a  step.  The  notion  of  dependence* 
mapping  (the  difference  between  the  indices  of  a  used  variable  and  the  ones  of  the  computed  vari¬ 
able).  introduced  in  Section  2.  allows  us  to  look  at  all  instances  of  a  variable  at  once  instead  of 
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looking  at  every  instance  in  turn.  It  is  sufficient,  as  shown  in  Sections  5  and  G.  to  consider  compo¬ 
sitions  of  dependence  mappings  around  cycles  instead  of  individual  dependence  mappings  for  the 
verification  of  computability.  Moreover,  for  affine  (or  uniform)  dependence  mappings,  computabil¬ 
ity  tests  can  be  performed  by  examining  those  properties  of  compositions  of  dependence  mappings 
around  cycles  that  are  invariant  with  respect  to  affine  (or  uniform)  similarity  transformations  - 
so  as  to  keep  the  dependence  mappings  affine  (or  uniform).  Necessary  and  sufficient  conditions  for 
the  computability  of  uniform  recurrence  equations  are  given  in  Section  6. 

In  Sections  2  to  6  the  computability  analysis  is  shown  to  necessitate  the  decomposition  of 
algorithms  into  steps.  This  same  decomposition  will  also  be  used  for  scheduling  the  computations  in 
the  algorithm.  The  scheduling  process  first  determines  independently  for  each  step  all  its  possible 
schedules  and  then  selects  through  an  optimisation  process  one  schedule  for  each  step  in  order 
to  obtain  an  efficient  schedule  and  processor  assignment  for  the  whole  algorithm.  With  regard 
to  uniform  recurrence  equations,  Section  7  introduces  the  notion  of  a  ‘time  cone’  that  concisely 
characterises  the  totality  of  schedules  for  a  step,  and  describes  how  to  obtain  a  particularly  simple 
type  of  schedule  from  the  time  cone.  Although  a  step  is  computable  it  may  not  necessarily  possess 
a  time  cone,  the  dependence  mappings  are  then  converted  to  affine  ones  in  order  to  find  an  efficient 
schedule  for  the  step.  The  concluding  section  gives  a  flavour  of  the  scheduling  of  affine  dependence 
mappings  that  we  plan  to  investigate  in  depth  as  we  further  develop  our  theory. 

2.  Verification  of  Computability 

The  algorithms  to  be  considered  consist  of  coupled  systems  of  recurrence  equations.  In  a 
recurrence  equation  the  computation  of  a  variable  at  various  index  values  is  specified  as  a  function 
of  other  indexed  variables  (by  convention,  a  variable  is  computed  only  once  for  each  index  value, 
thus  no  overwriting  is  allowed). 

As  an  example  consider  the  algorithm  that  computes  for  a  given  symmetric  Toeplitz  matrix 
both  its  LDLt  factorisation  and  the  LDLT  factorisation  of  its  inverse. 

Toeplitz  Factorisation  Algorithm: 

1  <  *  <  n.  r,.0  =  a._, 

2  <  i  <  n,  s,,o  =  a, -i 

1  <j  <n  -  l.  pj  =  + 

j  +  1  <  i  <  n,  r,  j 
j  +  2  <  .  <  n,  ?,,j 

•*i.o  =  1 

1  <  «  <  y  +  1,  =  -PjSj+2-,.j-i  +  S..J-1 

Any  scheduling  procedure  should  a  fortiori  be  able  to  determine  whether  a  coupled  system  of 
recurrence  equations  is  computable,  specifically  it  should  be  able  to  verify  that  there  is  no  instance 
where  the  computation  of  a  variable  at  some  index  value  depends  on  that  variable  at  the  same 
index  value.  A  particular  instance  would  be  that  a  variable  with  index  value  P  depends  on  another 
variable  with  index  Q  which  in  turn  depends  on  the  first  one  at  P.  thus  forming  a  cycle  through 
one  intermediate  variable.  Verifying  computability  calls  for  the  examination  of  all  possible  cycles, 
going  through  any  number  of  intermediate  variables.  The  solution  to  this  problem  represents  a 
cornerstone  in  our  scheduling  procedure. 

The  verification  process  should  exploit  the  structure  of  the  computations  and  check  that  given 
coupled  systems  of  recurrence  equations  do  not  exhibit  any  cycle  —  without  direct  examination  of 
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Figure  1:  Illustration  of  Dependence  Mapping  as  introduced 
in  Definition  2.1. 


the  variables  at  every  index  value.  In  other  words,  the  time  to  detect  a  cycle  should  be  independent 
of  the  index  range  of  the  variables.  To  this  end,  we  shall  deal  with  sets  of  index  points,  called 
domains  and  ranges,  and  mappings  that  relate  the  two.  Explicitly,  one  defines: 

Definition  2.1.  The  mapping  Q  =  DVU(P)  associated  with  an  equation  up  =  f{vq ,  . . .)  is  called 
dependence  mapping  (the  precedence  of  v  over  u  in  the  subscripts  indicates  that  v  at  point  Q  must 
be  available  before  the  computation  of  u  at  point  P  can  commence).  The  domain  of  computation 
Cvu  is  the  set  of  index  points  P  for  which  the  variable  u  is  computed  according  to  the  equation 
up  =  f{vq,  ...),  it  is  the  domain  of  Dvu.  The  set  of  points  Q  at  which  variable  v  is  needed  in 
order  to  compute  up  for  all  P  €  Cvu  is  denoted  by  Rvu  and  is  the  range  of  Dvu.  These  notions  are 
illustrated  in  Figure  1. 

The  dependence  mapping  DVU(P)  =  Q  associated  with  a  recurrence  equation  up  =  f(vq,  . . .) 
represents  the  index  value  Q  of  v  as  a  function  of  the  index  value  P  at  which  u  is  computed. 

For  the  Toeplitz  factorisation  algorithm  the  dependence  mappings  are 

Dar(t\0)=.--1,  Da,(i,0)  =  i—l 

DPr(i,j)=j,  Drr{i,j)  =  («  -  l,/-  1),  D,r(i.  j)=(i,j-  1) 

DeA'J)  =  ./'-  =  (*'  -  1 J  ~  1) 

=  UJ  -  i).  DlVV'j)  =  ( J  + 2  - »' J  -  i) 

Dr„(j)  =  (j,  j  -  1),  D,p(j)  =  (j  +  1J  -  1). 


and  the  corresponding  domains  of  computation  are 

Car  =  {(t.O)  :  1  <  .  <  n},  Caj  =  {(*,0)  :2</<r,} 

C fir  =  Crr  =  C,r  =  {(»,»  :  j  +  1  <  «  <  «.  1  <><'»-  1} 

Cp,  =  {(«./)  :  1  <  I  <  n,  1  <  j  <  n  -  1}.  Crs  =  {(i,j)  :  j  +  2  <  t  <  n,  1  <  j  <  n  -  1} 

=  {(*,  j)  :  1  <  «'  <  n,  1  <  j  <  n  -  1},  C™  =  {(i,j)  :  1  <  «  <  j  +  1,  l  <  j  <  n  -  1} 

Crfi^C,,,  =  {j  :  1  <j  <  n  -  1}. 

The  ranges  may  readily  be  determined  from  the  dependence  mappings  and  their  domains. 
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3.  Renaming 

During  the  process  of  verifying  the  computability  of  a  variable  one  realises  that  certain  instances 
of  this  variable  could  not  possibly  result  in  a  self-reference.  The  purpose  of  renaming  is  to  identify 
and  extract  for  each  variable  its  'harmless'  instances  so  that  the  remaining  instances  may  then  be 
subjected  to  a  verification  process.  This  is  accomplished  by  dividing  up  the  domains  of  computation, 
and  distinguishing  the  resulting  subdivisions  by  giving  the  associated  instances  of  the  variable  a 
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Figure  2:  Dependence  Graph  for  the  Toeplitz  Factorisation 
Algorithm. 

new  name.  .As  a  result  each  variable  may  be  identified  with  exactly  one  domain  of  computation, 
and  only  certain  variables  need  to  be  examined  for  computability. 

At  first,  a  directed  dependence  graph  [4,  5]  is  constructed  whose  nodes  correspond  to  the  vari¬ 
ables  and  whose  arcs  represent  the  dependences  between  variables.  Figure  2  shows  the  dependence 
graph  for  the  recurrence  equations  of  the  example  algorithm.  However,  this  graph  captures  only 
the  dependence  with  regard  to  variable  names  but  does  not  take  into  account  index  values.  For 
instance,  one  could  infer  from  Figure  2  that  the  composition  of  dependence  mappings 

D,r  O  Dr,  O  DiV 

•  •  ...  /  1  l 

may  lead  to  a  cycle  while  in  fact  one  cannot  perform  this  composition  since  the  range  of  D\,  and 
the  domain  of  Drs  do  not  intersect.  The  computability  analysis  would  be  greatly  simplified  if  any 
composition  of  dependence  mappings  associated  with  a  cycle,  i.e.  a  closed  path,  in  the  dependence 
graph  were  well-defined  (these  cycles  should  not  be  mistaken  for  the  cycles  encountered  earlier, 
which  are  cycles  in  a  larger  graph  where  each  node  corresponds  to  a  variable  at  a  particular  index 
point).  Thus  in  each  cycle  of  the  dependence  graph  the  range  of  the  dependence  mapping  associated 
with  an  arc  is  required  to  have  a  non-empty  intersection  with  the  domain  of  the  mapping  associated 
with  the  next  arc  in  the  cycle.  This  may  be  ensured  by  properly  assigning  variable  names  to  the 
l®ft-hand  sides  of  recurrence  equations.  Consequently,  the  first  step  in  our  scheduling  procedure 
will  perform  a  'renaming'  of  the  variables  in  the  algorithm. 

4.  Renaming  Procedure 

1.  Construct  the  dependence  graph  of  rhe  algorithm. 

2.  Determine  the  strongly  connected  components  of  the  dependence  graph  via  an  efficient  proce¬ 
dure  such  as  the  one  in  )9]  (see  also  [4;  and  r- blocks  in  [5]).  The  dependence  graph  for  the 
example  contains  only  two  strongly  connected  components:  {r.s.p)  and  the  single  node  <i. 
Since  all  dependence  mappings  involved  in  cycles  incident  on  the  same  node  must  correspond 
to  arcs  within  the  same  strongly  connected  component  in  rhe  dependence  graph,  the  renam¬ 
ing  may  be  performed  by  examining  each  strongly  connected  component  in  turn.  Thus  the 


remaining  steps  of  the  procedure  are  applied  to  every  strongly  connected  component  in  the 
dependence  graph. 

3.  Find  all  'well-defined’  elementary  cycles  within  a  strongly  connected  component ,  where  a  cycle 
is  elementary  if  its  arcs  have  all  different  initial  endpoints  and  it  is  well-defined  if  the  dependence 
mappings  associated  with  its  arcs  can  be  composed  according  to  the  order  imposed  by  the  cycle. 
This  is  done  by  checking  for  each  elementary  cycle  within  the  strongly  connected  component 
whether  the  range  of  each  of  its  dependence  mappings  intersects  the  domain  of  the  subsequent 
dependence  mapping. 

4.  Find  those  cycles  which  can  be  iterated  arbitrarily  many  times,  in  these  ‘iterative'  cycles  the 
number  of  possible  iterations  is  unbounded  as  the  cardinality  of  the  domains  of  the  variables 
involved  in  the  cycle  is  increased.  This  is  done  by  performing  a  test  whose  description  follows 
for  each  (well-defined)  cycle. 

Let  7  be  the  cycle  under  consideration  and  let  u  be  a  node  in  7;  denote  by  D1U  :  C1U  — •  i?-,u 
the  composition  of  dependence  mappings  around  the  cycle  starting  at  u.  where  C-,u  is  the 
domain  of  the  first  dependence  mapping  in  the  cycle  7  when  starting  at  u,  and  intersects 
C^u  since  the  cycle  is  well-defined.  The  Zth  iterate  of  D-/u,  i.e.  the  composition  of  D-/u  with 
itself  repeated  l  times,  is  denoted  by  Dllu  :  C1U  —  R-jl .  The  range  of  D^u  may  be  determined 
through  the  recurrence 


=  C,U,  rMsRiu 

2<k<l,  C$  =  Rlf-VnCiu 

=  D-:u(C^). 


The  cycle  is  iterative  if  and  only  if  |C^u+1^j  may  be  made  non-zero  for  any  /  by  selecting  the 
problem  size,  hence  |C1U|.  sufficiently  large.  This  condition  is  equivalent  to  being  able  to 
make  the  ratio  \C-‘ul'>\/\C-.u\  non-zero.  The  ratio  |Cl<„+1*|/|C-.u|  characterises  the  'shrinkage'  of 
the  computation  domain  after  l  iterations  of  D-.u  and  is  just  the  product  of  the  'incremental' 
shrinkages 


(*)  -  i 


S'  — 

’  “v  1 1  - 


'(7 

I*-'-  a 


*+1), 


I  clk,} 


I:  =  1.2 . / 


Therefore  the  cycle  7  is  iterative  if  am  1  only  if  all  .  1  <  k  <  /.  may  be  made  non-zero.  Note 
that  one  need  only  construct  the  sequence  of  the  ratios  si**  for  a  single  uo<b'  u  along  *  in  order 
to  determine  if  -  is  iterative. 
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Figure  3:  Cycle  Composition  Graph  for  Variable  s  in 
Toeplitz  Factorisation  Algorithm. 


For  illustration  consider  the  cycle  7  associated  with  the  dependence  mappings  Drp 
and  Dpr ,  the  compositions  along  the  cycle  are 

D~,r(i,j)  =  Drp  o  Dfir(i,j)  =  (j .  j  —  1), 

D~lp(j)  =  Dpr  O  Drp  (j  )  —  J  —  It 


we  also  have 

and  it  is  easily  seen  that 

2(» 

n(,i  -  1)’ 


C^r=Cpr,  C\p  =  Crp, 


3(h_2(»»-2) 
— 


,(()  - 

n  -  1  -  / 

ir  ~ 

n  —  l 

,(0  _ 

n-l-l 

]1P  - 

n  —  / 

/  =  2, 3, 
/  =  1.2, 


The  cycle  is  iterative  since  s,r  (and  s^j)  is  not  identically  0  for  any  l  when  viewed  as 
a  function  of  the  domain  size  parameter  n. 


5.  Partition  the  domains  of  computation  for  each  variable.  For  each  node  u.  determine  those 
pairs  of  well-defined  elementary  cycles  incident  on  u  that  can  be  composed  with  each  other  as 
follows. 

Consider  all  (well-defined)  cycles  7 *  incident  on  u.  and  partition  the  set  UjfcC-u,ti  into  the 
subsets  C„,  induced  by  the  intersections  and  differences  of  the  domains  ClkU. 

Consider  node  s  in  the  strongly  connected  component  {r,  3.  p }.  Four  elementary  cycles 
7  *,  1  <  k  <  -1.  are  incident  on  that  node,  corresponding  to  Dll  * .  D,,,  o  Dp,  and 

D,r  o  Dr,,  respectively.  The  intersections  and  differences  of  the  domains  C\kl  induce 
the  partition  of  (Jv  ClkS  into  C',  1  U  C',j  where 

c,,  =  {(«.;)  :  1  <  t  <  y-r  1.  1  <  ;  <  n  -  1}. 

C, 2  =  { (*.  j)  :  j  4-  2  <  i  <  n.  1  <  j  <  n  —  1}. 


Construct  the  directed  graph  whose  nodes  u,  correspond  to  the  subsets  C,„.  There  is  an 
arc.  labelled  7*.  from  node  C„,  to  node  CHJ  if  there  exists  P  6  Cu,  and  Q  S  CUJ  such  that 
Q  =  D-.kU[P ):  in  other  words,  there  is  an  arc  from  C,„  to  CUJ  if  D-.klt(ClkU  O  C,„ )  n  CUJ  r  0. 


Continuing  with  the  example,  from 


C'i  i  j  —  0,  i  U  C,2 ,  C-i :  a  —  C',  i 
01,3  —  0,1  U  0, 2,  C.,,,  =  Cj2 

we  have 


Chfia  n  Chi  —  Chi, 

£b,|j(Chi)  n  Chi 

*0. 

0->,,(Chi)  n  C, 2  =  0 

C-t ,a  n  C,2  —  0,2 1 

0ii,(0, 2)  FI  0,1 

=  0, 

0-r ,  ,  (C’,2 )  0  C,2  r  0 

C-^  n Chi  =  Chi, 

0i2,(0,i)  n  Chi 

5^0. 

0-rj3(C,i)  PI  C’,2  =  0 

Ch^  n  0,2  =  0 

C- ,,,  n  C,i  =  Ch„ 

01„(0,l)  n  Chi 

=  0, 

01,3(0,1)  n  C, 2  5*  0 

Ch, ,3  FI  0,2  —  C$2, 

0a,3  (Ch2)  n  Chi 

=  0, 

01,3(0,2)  Fl  0,2  7*-  0 

C,43  fi  Chi  =  0 

C’-,,,  n  C,2  =  CV2, 

0^,3  (C’,2)  n  0,  1 

=  0, 

013,(0,2)  nc,2^0. 

The  associated  graph  is  displayed  in  Figure  3. 


In  general,  find  the  partition  of  Ufc0-y*u  induced  by  the  strongly  connected  components  of  this 
graph.  The  elements  in  the  partition  are  thus  unions  of  subsets  CUI  .  This  partition  is  called  the 
•composition-induced’  partition  of  u.  (In  the  reduced  graph  obtained  from  the  condensation  of 
each  strongly  connected  component  into  a  single  node,  the  cycles  7*  incident  on  the  same  node 
may  be  composed  in  any  order;  this  is  the  property  we  were  looking  for.) 


The  graph  in  Figure  3  has  two  strongly  connected  components,  and  s2-  The  analysis 
is  even  simpler  for  the  two  other  variables  r  and  p.  Since  all  the  C~ttr  are  identical, 
the  graph  for  r  has  only  one  node  hence  just  one  strongly  connected  component;  the 
same  is  true  for  p. 


G.  Explicitly  rename  instances  of  variables  in  the  strongly  connected  component  so  that  the  set  of 
well-defined  cycles  incident  on  the  same  variable  is  closed  under  composition,  i.e.  all  elementary 
cycles  incident  on  the  same  variable  may  be  iterated  an  arbitrarily  large  number  of  times  and 
composed  in  arbitrary  order. 

Renaming  can  be  performed  independently  for  each  variable  since  it  is  based  exclusively  on  the 
analysis  of  the  elementary  cycles  incident  on  that  variable,  independent  of  the  names  of  the 
nodes  that  these  cycles  traverse.  For  each  variable  it.  it  amounts  to  constructing  a  partition  of 
U*  C~,kH  into  nonoverlapping  subsets  and  to  giving  distinct  names  to  the  instances  of  u  whose 
indices  belong  to  distinct  subsets. 


In  the  actual  process  of  renaming  rwo  issues  have  to  Ire  taken  care  of.  Non-iterative  cycles 
•;h  should  be  'unrolled':  this  means  that  a  different  name  is  given  to  the  instances  of  u  whose 
indices  belong  to  the  subsets  -  C-'+u1.  1  <  i  <  I.  where  /  is  the  maximum  number  of 
iterations  of  cycle  7*  (i.e.  C-^u  ^  0  and  =  0).  The  composition  of  distinct  elementary 

cycles  incident  on  the  same  variable  should  be  well-defined:  this  means  that  distinct  names  are 
given  to  the  instances  of  u  according  to  the  <  aposition-induced  partition  of  u. 

Since  an  instance  up  may  depend  on  other  instances  of  >1  through  both  iterative  and  non¬ 
iterative  cycles,  the  two  sources  of  renaming  must  Ire  combined.  This  is  simply  done  by  renam¬ 
ing  according  to  the  composition-induced  partition  of  jj(.  0-..,,  as  well  as  the  partitions  of  the 


C-.ktt  associated  with  the  non-iterative  cycles  7;  into  (J(  (CT*„  -  ( 


No  well-defined  elementary  cycle  within  the  strongly  connected  component  {r.s.p} 
is  non-iterative.  Moreover  the  union  of  the  domains  associated  with  variable  s  is  the 
only  one  to  have  a  non-trivial  composition-induced  partition.  Thus  only  the  instances 
of  s  corresponding  to  the  sets  C',i  and  C\ 2  "  ill  be  given  different  names;  we  chose  to 
use  y  for  the  index  values  in  Ctl  and  to  keep  s  for  C’,2- 


7.  Update  dependence  mapping s  and  initialisations. 


To  motivate  this  kind  of  post-processing  consider  once  more  the  example.  After 
renaming  the  last  recurrence  equation  in  the  Toeplitz  factorisation  algorithm 

1  <  j  <  n  —  l,  1  <  1  <  j +  1.  3,,j  =  —  +  s,.j_i 

becomes 

1  <  J  <  t»  -  1,  \<i<j+l.  y,.j  =  +  i/i.j-i- 

Observe,  however,  that  yl  0  is  not  defined  since  the  renaming  procedure  left  the 
assignment  s(  0  —  1  unchanged.  Clearly  the  instance  sr.o  must  be  converted  to  y j.o- 


Conseriuently.  as  renaming  is  performed,  the  dependence  mappings  must  also  be  updated.  Each 
dependence  mapping  Dt.„  is  fully  characterised  by  the  mapping  'function'  and  its  domain.  If 
\j  is  one  of  the  new  names  for  u  then  the  dependence  functions  of  y  on  other  variables  are  just 
a  subset  of  the  dependence  functions  of  11  on  other  variables.  While  the  functions  remain  the 
same,  the  new  and  old  mappings  differ  in  their  domains  (hence  also  ranges).  The  according, 
simple  updating  of  domains  is  performed  at  this  point  for  all  variables  in  the  algorithm. 

For  each  renamed  variable  >j  consider  the  elementary  cycles  7 *  incident  on  that  variable.  The 
functions  D-.ky  form  a  subset  of  the  functions  D-.kU  and.  by  construction,  the  domains  of  the 
mappings  D-.ky  are  all  identical  and  equal  to.  say  Cy.  The  instances  of  the  old  variable  u 
whose  indices  belong  to  (Jr  D-.k  (Cy)  and  are  outside  tjt  C-:kU  are  renamed  y.  It  may  happen 
that  the  same  instance  of  u  is  renamed  more  than  once  but  this  does  not  pose  a  problem  as 
it  corresponds  only  to  additional  assignments  in  the  renamed  algorithm  (for  example  11  could 
be  renamed  into  y  and  z  so  that  the  assignment  t/j  0  =  1  would  be  replaced  by  yi.o  =  1  and 
-'1.0  —  1 1- 

Upon  completion  of  this  last  renaming  operation  the  renamed  algorithm  may  be  written  in  full, 
its  dependence  graph  constructed  and  strongly  connected  components  determined  for  further 
use.  The  new  dependence  graph  will  typically  have  more  strongly  connected  components  than 
the  original  dependence  graph  if  some  renaming  has  effectively  m cured.  The  portions  of  the 
algorithm  corresponding  to  the  strongly  connected  component'  in  that  graph  are  called  fie* 
'steps'  of  the  algorithm. 

To  sum  up.  renaming  ensures  -hat  the  mapping'  rnrr-'j  a  iiug  m  *he  »l,unon*.try  cy  d •-«.  within 
a  step  are  all  well-defined,  can  he  iterated  arbitrarily  many  rime'  tmi  composed  in  arhifrarv 


Figure  4:  Dependence  Graph  for  the  Toeplitz  Factorisa¬ 
tion  Algorithm  after  Renaming. 


The  new  version  of  the  Toeplitz  factorisation  algorithm  after  renaming  follows. 

Toeplitz  Factorisation  Algorithm: 

1  <  i  <  n.  r,  o  =  <t,_  [ 

•’  <  !  <  II.  S,  0  =  <t,_  i 

1  <  j  <  rt  -  1.  P;  =  •'*;  +  ]. ^-l/r,.;_] 

j  -e  1  <  i  <  ft,  r,j  =  r,_  j  ^  !  -  Pj-Vg-l 

J  +  -  <  «  <  n.  =  —pj  r,_  i,7_  i  -  3,.j_i 

i/».n  =  1 

1  <;<■'-  l.  l  <  /  <  >  -  1.  it,.  .  =  -,yy.+2-,.j-i  -r  u,.j- 1 

The  .v>.v '-eiated  .ieprnden  e  graph  i>  gi\e:t  in  Figure  I. 

5.  Affine  and  Uniform  Dependence  Mappings 

As  a  p'-ult  • *:.•'  : -'naming  pr-cess.  maj  pings  associated  with  eyries  in  a  step  are  closed  under 
composition,  an  1  v:  -.tying  the  •••-•mpur  ihiliry  of  »  st.-p  amounts  fo  an  analysis  of  compositions  of 
its  ryries.  The  hey  . d .v rv  : : :  ui  t  t.;e  an  d'.  sis  ;s  i -a.se  \  -,n  rjje  far*  that  srlf-depen  deuce  is  invariant 

under  a  change  .>f  i--x:  an  instance  .•<■  dm  •■mis  on  ir-elf  if  and  only  if  i  f-,  depends  m  itself,  where 

P  —  FiP'i  and  F  >  j.jje, Thu-  crujiputai-uty  is  invariant  with  i-'perr  to  changes  of  rh**  indices 
of  the  variaides.  in  1  ;r  'liould  * i •  ■  :  e  vrihed  ;  v  ••x.iminiuc  'nlv  those  properties  ,.f  mannings 


associated  with  the  cycles  that  are  invariant  (with  respect  to  changes  of  indices). 

To  characterise  these  invariants,  consider  the  effect  of  change  of  index  transformations  applied 
to  every  variable  u 

Fu  :  Cu  -  C„ 

where  Cu  is  the  domain  of  computation  for  variable  u.  Under  these  changes  of  indices  a  dependence 
mapping  Dvn  becomes 

Dvu  =  Fv  o  Dvu  of'1. 

so  that  the  composition  of  dependence  mappings  D-lU  associated  with  a  cycle  incident  on  a  variable  u 
undergoes  a  similarity  transformation: 

D1U  =  Fuo  D-.„  of;1. 

Consequently,  computability  is  determined  from  those  properties  of  the  mappings  associated  with 
cycles  that  are  invariant  with  respect  to  similarity  transformations. 

To  carry  the  analysis  further  we  shall  now  restrict  the  study  to  affine  dependence  mappings, 
that  is.  mappings  of  the  form  Q  =  D{P).  where  D(P )  =  DP  +  d  is  an  affine  transformation.  Note 
that  a  non-linear  dependence  mapping  may  turn  out  to  be  an  affine  dependence  mapping  in  disguise 
and.  as  such,  can  be  transformed  into  an  affine  dependence  mapping. 

The  dependence  mapping  for  the  variable  u  in  the  recursion 

u(, -i)  =  +  1  = /(»,).  <  €{3.3,17 . 22"  +  l}.  u3  given. 

where  the  index  t  runs  over  the  set  of  the  n  +  1  first  Fermat  numbers,  is 

£>,„((<  -  l)2  +  1)  =  i  Vi  6  C„  =  {5.  17 . 22"  +  1}. 

Although  F)uu  is  not  an  alline  transformation,  it  can  be  expressed  in  the  form 

D.lH  =  Ffl  o  D,JU  o  F„.  where  F,t(i)  =  log2  log2(t  -  1),  =  j  -  1. 

and  D  is  an  affine  function.  Thus,  if  the  index  set  0„  is  transformed  via  Fu  then  the 
resulting  dependence  mapping  D„.,  is  affine.  Clearly,  the  transformation  Fu  is  just  a 
change  of  index  and  it.  does  not  change  the  computations  of  the  recursion. 


This  situation  should  be  recognised  when  it  occurs  since  computability  of  affine  dependences  seems 
to  be  more  easily  verified  than  that  of  arbitrary  dependences  and.  looking  farther  ahead,  advan¬ 
tages  will  also  accrue  with  regard  to  scheduling  and  processor  assignment.  Arbitrary  non-li”  'ar 
dependence  mappings,  however,  are  not  likely  to  be  convertible  to  affine  ones  through  change  of 
index  tra:  orinations  F.,  as  evidenced  by  the  FFT  algorithms.  As  already  mentioned,  all  depen¬ 
dence  mappings  encountered  from  this  point  on  will  be  affine.  For  affine  dependence  mappings, 
computability  tests  are  performed  by  examining  properties  of  cycles  that  are  invariant  with  respect 
to  affine  similarity  transformations  F. :  the  F.  should  be  affine  in  order  to  keep  the  dependence 
mappings  .tffine. 

A  uniform  dependence  mapping  is  an  affine  dependence  mapping  of  the  form  D>  P)  —  IP  — 
/.  where  I  is  th"  identity  matrix.  Thus  a  uniform  depen  fence  mapping  A  .just  i  translation 


and  the  ’dependence  vector"  Di  P) 
Recurrence  equations  for  which  ill 


n 


equ.ai 


to  ■/.  in  b 


lent  of 


•  lence  mapping--  are  unit-  rm  a 


the  point  P  in  the  domain, 
re  called  uniform  recurrence 
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Figure  5:  Dependence  Graph  for  the  Two-Dimensional  Filter 
Example. 

equations  [4].  To  test  the  computability  of  such  algorithms  one  should  consider  properties  of  cycles 
that  are  invariant  under  translations  F,  (in  order  to  keep  the  dependence  mappings  uniform). 
The  composition  of  uniform  dependence  mappings  associated  with  a  cycle  is  D~,U(P)  =  IP  +  d-,u 
where  d-,u  is  just  the  sum  of  the  individual  translation  vectors  dv u;  the  vector  d-,,,  clearly  depends 
only  on  the  cycle  *•  and  not  on  the  variable  u  at  which  the  cycle  is  started,  hence  one  can  write 
D~.U(P)  =  IP  +  d.,.  Under  index  changes  of  the  form  F.  (P)  =  IP  +  d.  a  uniform  dependence 
pping  DVU(P )  =  IP  +  dvil  becomes 


DVU[P)  =  IP  +  dvu,  where  dvu  =  dvu  +  (</,.  -  du). 

and  the  dependence  mapping  D-,„  around  a  cycle  is  left  invariant.  The  translations  (/-,  around 
the  elementary  cycles  within  the  dependence  graph  represent  the  invariants  to  be  examined  for 
computability  verification. 

6.  Computability  of  Uniform  Recurrence  Equations 

A  direct  application  of  the  definition  of  computability  would  lead  to  the  exhaustive  examination 
of  all  self-dependences  of  a  variable  u  in  a  step.  The  notion  of  dependence  mapping  allows  us  to 
look  at  all  instances  of  a  variable  at  once  instead  of  looking  at  every  instance  in  turn.  We  also 
know  from  before  that  it  suffices  to  consider  cycles  instead  of  individual  dependence  mappings  for 
the  verification  of  computability.  Moreover  the  renaming  process  ensures  that  any  composition  of 
cycles  in  a  step  is  well-defined.  Consequently,  instead  of  exhaustively  examining  all  self-dependences 
we  consider  arbitrary  compositions  of  elementary  cycles  thus  exploitating  the  property  that  the 
mappings  are  uniform. 

Arbitrary  compositions  of  elementary  cycles  */<,.  within  the  step  result  in  an  overall  translation 

a,  >  0. 

k 
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Figure  6:  Cycle  Composition  Graph  for  the  Two-Dimensional 
Filter  Example. 


where  A*  represents  the  number  of  times  cycle  7*  is  traversed.  If  a  linear  combination  with  positive 
coefficients  is  to  correspond  to  a  cycle  through  the  dependence  graph  the  subgraph  formed  by  the 
nodes  associated  with  the  non-zero  coefficients  must  constitute  a  connected  component  in  the 
graph  T  that  is  defined  as  follows:  the  nodes  of  F  are  the  elementary  cycles  in  the  step,  and  two 
nodes  are  connected  by  an  edge  if  the  corresponding  cycles  share  a  variable. 


Two-Dimensional  Filter  Example: 

In  this  example  from  [8]  the  filtered  image  wn+i,j.k  is  computed  from  the  given  image 
Wij/c  according  to  the  equations 


1  <  i  <  n, 
1  <  j <  rn.  1  <  k  <  m. 
1  <  i  <  ft.  1  <  j  <  rn,  1  <  k  <  m. 


><i00  —  =  t<i.o.  —  i  =  0 

''njfcSO.  wijk  given 

U|.j-+-1,A:H-1  =  /(ttjjfc.  Vijk‘ 
I'i-i.j.k  =  t ',jk) 

wi+l.j.k  =  h[lltjk,  l'jjk<  Wijk) 


The  corresponding  dependence  graph  is  shown  in  Figure  5  and  the  translations  in  the 
dependence  mappings  are  readily  identified  from  the  equations 


There  are  six  elementary  cycles 


7 1  —  72  —  {^vu}i  7a  —  {Dww  } 

7-4  =  {Duv,  Dvu}i  75  =  { Duw ,  u } *  7c  —  {Dwu,  D ut, ,  Dvuj} 
whose  translation  vectors  are  easily  seen  to  be 


d-,-2 

d-,> 

d s. 

d-n 

0 

1 

-1 

1 

-1 

0 

-1 

0 

0 

-1 

-1 

-1 

.  -1 

0 

0 

-1 

-1 

-1 . 

Figure  G  contains  the  graph  T  that  indicates  cycles  with  common  variables. 


A  variable  u  is  not  computable  if  the  following  three  conditions  are  satisfied: 

1.  The  equation  Ylk  —  0  has  a  non-trivial  positive  solution  {At}. 

2.  At  least  one  of  the  cycles  7*  for  which  A*  /  0  traverses  «. 

3.  The  subgraph  in  V  formed  by  the  nodes  corresponding  to  At  ^  0  is  a  connected  component 
(this  condition  ensures  that  one  can  go  from  one  cycle  to  another). 

In  general,  a  step  is  computable  if  all  its  variables  are  computable,  i.e.  only  conditions  1  and  3 
must  be  satisfied,  omitting  condition  2.  Thus  testing  computability  amounts  to  determining  all  the 
non-trivial  positive  solutions  to 

yi  ^kdllt = 0 

* 

and  examining  for  each  such  solution  the  connectedness  of  the  subgraph  of  T  that  is  obtained  by 
deleting  the  nodes  corresponding  to  A*  =  0. 


The  second  and  third  components  of  the  vectors  </,,  reveal  that  any  positive  linear  com¬ 
bination  Ylk  A  *>/-.,  =  0  must  satisfy  ,\,  =  A(  =  As  =  ,\e  -  0.  Coefficients  of  the  form 
A2  =  A3  >  0  solve  the  remaining  si  tem  A 2‘L.  ■+■  A31  —  0.  However,  the  nodes  in  T. 

-;2  and  73.  that  are  associate.)  » '■  .  ,  and  A  j.  are  not  connected.  Consequently,  the 

algorithm  is  computable. 


7.  Scheduling  of  Uniform  Recurrence  Equations 

It  is  clear  from  the  previous  sections  that  a  computability  analysis  necessitates  the  decom¬ 
position  of  algorithms  into  steps.  This  same  decomposition  will  also  be  used  for  scheduling  the 
computations  in  the  algorithm.  The  scheduling  process  first  determines  independently  for  each 
step  all  its  possible  schedules  and  then  selects  through  an  optimisation  process  one  schedule  for 
each  step  in  order  to  obtain  an  efficient  schedule  and  processor  assignment  for  the  whole  algorithm. 
In  this  section  we  only  describe  the  information  characterising  the  schedules  for  an  individual  step 
and  the  way  this  information  is  derived.  The  global  scheduling  and  processor  assignment  problem 
is  presented  and  its  solution  discussed  with  the  help  of  an  example  in  [l.  2], 

There  is  a  fair  amount  of  freedom  in  selecting  t he  indices  for  the  variables  in  the  various 
recurrence  equations.  One  can  for  instance  choose  not  to  have  any  relationship  among  the  indices 
in  each  equation. 


The  recurrence  equations  in  the  2-D  filter  example  could  be  rewritten 


1  <  i  <  n,  1  <  j  <  m,  1  <  k  <  in,  u,.J+  i,*+i  =  ty  tiy. _,»*») 

1  <  t'  <  n.  1  <  /  <  m.  I  <  k'  <  m,  iy_i.y.fc<  =  g(u, jk, 

1  <  <  n.  1  <  j"  <  m,  1  <  k"  <  m,  u.y+i,.»".fc"  =  /i(uljfc,  ty _,•*«.  uy-j-.*..) 

There  is  no  relationship  between  Pu  =  (i,j,k),  Pv  =  (i',j',lc')  and  Pu.  =  (i" ,  j" ,k"). 

It  is  also  possible  to  impose  some  relationship  among  the  indices,  for  instance  Pu  =  Pv  =  Pw  in 
the  above  example  results  in  the  original  version  of  the  algorithm.  Once  one  decides  to  relate  the 
indices  of  all  variables  in  a  step  many  choices  are  possible.  All  these  choices  may  be  obtained  by 
replacing  the  index  Pu  of  each  variable  u  in  a  step  by  F~l(P),  the  bijections  Fu  being  arbitrary 
and  the  index  P  being  the  same  for  all  variables  in  the  step. 

A  schedule  for  each  step  is  defined  through  a  suitable  set  of  such  transformations  Fu  and  a 
vector  r,  called  'time'  vector,  in  such  a  way  that  an  instance  of  a  variable  u  with  index  P„  is 
computed  at  time  rT P  where  P  =  FU(PU).  Of  course  not  all  transformations  Fu  and  vectors  r 
define  a  schedule;  they  must  also  satisfy  ttQ  <  rT P  whenever  a  variable  at  point  Q  is  required 
for  the  computation  of  a  %'ariable  at  point  P.  The  set  C  =  (JPufC,,)  may  be  viewed  as  the 
computation  domain  for  the  whole  step;  in  contrast  to  the  domain  Cu  for  a  single  variable  more 
than  one  computation  may  be  associated  with  a  point  P  in  C.  Also  observe  that  different  choices 
of  t  may  result  in  schedules  with  widely  differing  degrees  of  parallelism. 

For  a  cubical  domain  with  \C\  =  nQ  the  schedules 


result  in  respective  computation  times  qn.  n2.  n3  and  n'1  up  to  lower  order  terms. 

More  specific  schedules  can  be  defined  by  requiring  the  transformations  Fu  to  belong  to  a 
certain  class.  When  the  transformations  Fu  are  restricted  to  be  affine,  i.e.  FU(PU)  =  FuP„  +  du,  the 
schedule  for  the  uniform  recurrence  equations  is  called  affine.  A  translational  schedule  is  obtained 
if,  in  addition.  Fu  =  I  for  all  variables  u. 

We  shall  now  characterise  all  translational  schedules  for  a  step  that  consists  of  uniform  recur¬ 
rence  equations.  Application  of  the  transformations  F„  results  in  P  =  F„(P„)  =  Pu  +  d„  and  each 
dependence  mapping  Qv  —  DVU(P„)  =  Pn  +  dvu  becomes  Q  =  Pr  where  dvu  =  </,,  +  dvu  -  d„. 
The  set  of  translation  vectors  d„  and  a  time  vector  r  fully  define  a  translational  schedule  subject 
to  the  condition  that  rTQ  <  rT P  whenever  a  variable  at  point  Q  is  required  for  the  computa¬ 
tion  of  a  variable  at  point  P.  This  condition  is  easily  seen  to  be  equivalent  to  the  condition  that 
rTdvu  <  0  for  all  the  dependence  mappings  in  the  strongly  connected  component.  The  notion  of 
'time  cone'  provides  a  characterisation  of  all  translational  schedules,  this  characterisation  can  be 
used  for  the  fast  determination  of  translational  schedules  that  have  certain  desired  properties  and 
for  the  solution  of  the  global  scheduling  problem. 

For  each  elementary  cycle  7*  in  the  strongly  connected  component  let  vk  be  the  number  of 
arcs  in  the  cycle  and  d.,k  be  the  translation  vector  associated  with  the  cycle.  The  time  cone  of  the 
strongly  connected  component  is  the  set  of  vectors  r  satisfying  ~Td-.k  <  -i'kj  for  all  elementary 
cycles  7 k  in  the  component,  where  g  is  the  greatest  common  divisor  of  the  components  of  7. 


Figure  7:  Tree  T  from  the  Single-Source  Maximum-Cost  Al¬ 
gorithm  in  Theorem  7.1. 

The  following  theorem  provides  the  main  justification  for  the  above  definition. 

Theorem  7.1.  For  each  r  in  the  time  cone  one  can  determine  translation  vectors  clu  such  that  for 
every  arc  (u.  v)  in  the  strongly  connected  component  7-r</t.u  <  -g,  where  dvu  =  dvu  +  dv  —  du. 

Proof.  Let  r  be  a  vector  in  the  time  cone.  With  every  arc  (u,v)  in  the  strongly  connected  compo¬ 
nent  G  (in  particular  each  arc  in  a  set  of  multiple  arcs)  associate  the  cost  Ct,u  =  rTdvu  +  g.  Since  r 
belongs  to  the  time  cone,  the  sum  of  c,,„  along  a  cycle  is  at  most  zero. 

Select  a  starting  node  r  and  construct  the  tree  T  resulting  from  the  single-source  maximum- 
cost  algorithm  (this  is  just  a  single-source  shortest-path  algorithm  with  cost  equal  to  — c,.„  for  each 
arc  (u,  v)).  Then,  starting  from  the  root  r.  assign  to  each  node  u  an  integer  cu  such  that  for  every 
arc  (it,  v)  in  T  the  new  cost  is  c,.u  =  cvu  +  cv  -  cu  —  0  (ty  =  0).  We  claim  that  cvu  <  0  for  all  arcs 
( u .  v)  in  G. 

Observe  that  the  arcs  in  G  -  T  can  be  grouped  into  three  categories: 

•  forward  arcs  (from  a  node  to  a  descendent  in  T.  solid  line  in  Figure  7) 

•  backward  arcs  (from  a  node  to  an  ascendant  in  T,  dashed  line  in  Figure  7) 

•  cross  arcs  (the  remaining  arcs,  dotted  line  in  Figure  7). 

Consider  an  arc  (it.  e)  from  u  to  v  in  T  with  cost  cvu.  If  (u,  v)  is  a 

•  forward  arc  then,  by  construction  of  T,  its  cost  must  be  less  than  or  equal  to  the  cost  of  the 
corresponding  path  in  T ;  hence  c,.„  <  0. 

•  backward  arc  then  it  closes  a  cycle  in  G.  which  has  cost  at  most  0.  Since  the  cost  for  the  arcs 
in  T  is  0  one  must  have  c,.„  <  0. 

•  cross  arc  then  the  cost  of  the  path  from  r  to  e  must  exceed  the  sum  of  the  paths  from  r  to  u 
and  from  u  to  r.  otherwise  e  would  be  a  descendant  of  u.  Since  the  paths  in  T  from  r  to  u  and 
from  r  to  v  have  zero  cost,  one  must  have  c,.0  <  0. 

Clearly,  every  r,.u  is  a  multiple  of  g.  hence  (with  cr  =  0)  every  cu  is  a  multiple  of  g  and  the 
Diophantine  equation  for  the  components  of  </„,  rTd„  =  cu.  has  an  infinite  number  of  solutions 
which  can  be  easily  characterised.  For  any  set  of  translations  du.  each  satisfying  the  corresponding 
equation  rTdu  =  c„.  the  cost  on  every  arc  satisfies  <  0  so  that  rTd,.,t  <  -g  <  0.  Thus,  the 
translations  i/„  and  the  vector  r  define  a  valid  schedule. 

In  general,  to  determine  all  the  translations  < /„  which  define  a  valid  schedule  for  a  given  r 
within  the  time  cone,  one  would  have  to  examine  all  possible  assignments  cu  for  which  c,„  <  0  (the 
algorithm  above  constructs  just  one  such  assignment). 


For  a  vector  r  outside  the  time  cone  there  is  a  cycle  such  that  tt d-u>  >  —VkQi  hence  the  cycle 
contains  at  least  one  dependence  vector  i/t.u  with  ttcIvu  >  0.  whatever  translations  are  applied.  For 
this  particular  time  vector  there  is  thus  no  set  of  translations  du  that  defines  a  valid  schedule. 


.As  we  have  seen  in  Section  C  the  time  cone  corresponding  to  a  step  with  uniform  recurrence 
equations  may  be  empty  while  the  structure  of  the  graph  Y  indicates  computable  equations.  Thus, 
a  translational  schedule  may  not  always  exist  even  though  the  algorithm  is  computable. 

Consider  the  2-D  filter  example.  Since  =  —  d13  the  time  cone  is  empty  and  Theo¬ 
rem  7.1  implies  that  no  translational  schedule  exists  for  this  example.  Yet,  the  algorithm 
is  computable  as  was  proved  in  Section  G.  In  fact  the  longest  path  through  the  precedence 
graph  of  the  algorithm  is 
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Hence  a  lower  bound  on  the  computation  time  is  2n(m  +  1)  —  m  and  there  exist 
schedules  attaining  this  bound. 


Whenever  a  computable  step  of  uniform  recurrence  equations  does  not  possess  a  translational 
schedule,  schedules  at  the  next  level  of  complexity,  in  this  case  linear  schedules,  are  considered. 
'Good'  linear  schedules  are  obtained  from  affine  transformations  Fu  that  map  the  domain  of  each 
variable  u  onto  the  same  domain  C ,  Vu  :  FU(CU)  =  C  (after  renaming  all  variables  in  the  step 
have  domains  of  the  same  size)  thus  making  the  domain  C  for  the  whole  step  as  small  as  possible. 
Although  this  appears  to  be  a  only  a  heuristic  measure  with  respect  to  total  minimal  computation 
time,  we  plan  to  provide  a  rigorous  justification  of  this  procedure  in  another  paper. 

The  set  of  affine  transformations  that  map  the  domain  onto  itself  may  be  completely  chara- 
terised:  their  linear  parts,  for  instance,  have  all  eigenvalues  on  the  unit  circle. 

Possible  transformations  for  a  cubical  domain  with  edges  parallel  to  the  canonical  basis 
vectors  «i,  e2  and  63  could  be  a  reflection  with  respect  to  the  (er.ea)  plane  or  a  rotation 
with  respect  to  the  e\  direction.  The  linear  parts  of  these  transformations  are  respectively 
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and 

1 

1 

-1 

Since  the  transformations  Ftl  induce  similarity  transformations  on  the  compositions  of  depen¬ 
dence  mappings  D- ,kU.  the  new  D-.kU  are  still  translations  but  with  modified  translation  vectors  d~.kU. 
For  each  variable  u  and  every  elementary  cycle  7*  incident  on  u  one  has  D-.k>l  =  F„  °  D-.kU  o  F~: 
where 

DikU(Pu)  =  Pa  +  d-.k,  F „ ( Pu )  =  F„PU  +  d„.  F~l(P)  =  F'lP  -  F~lJu. 

Hence  D-;kU{P )  =  P  - f  Fud~,k  and  the  modified  translation  vector  is 

d-:k  u  —  Ft  id-:k- 

The  matrices  F„  are  selected  to  render  the  time  cone  associated  with  the  d~,k„  non-empty.  Suppose 
Au  =  {d-;k„}  is  the  set  of  all  vectors  associated  with  the  elementary  cycles  -•*  incident  on  >1.  Since 
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the  algorithm  is  computable,  the  time  cone  for  each  individual  Atl  is  non-empty  and  the  time  c one 
for  the  union  |JU  A„  can  also  be  made  non-empty  through  proper  choices  of  transformations  Fu. 

All  matrices  Fu  should  also  leave  at  least  one  common  direction  invariant:  there  exists  a 
vector  tr  such  that  Vu  :  Fua  =  a.  The  projections  of  the  ‘dependence  vectors'  DVU(P)  ~  P  along 
a  are  independent  of  P.  and  as  many  directions  a  as  possible  should  be  left  invariant  by  the 
transformations  Fu.  If,  in  addition,  the  inner  product  between  a  and  all  modified  vectors  associated 
with  the  cycles  is  negative,  Vd-(kU  :  d-,kUa  <  0.  then  there  exist  translation  vectors  du  and  time 
vectors  r  such  that  the  transformations  FU(PU)  =  F„PU  +  du  and  r  define  a  linear  schedule  for  the 
step.  Again,  the  proofs  and  the  development  of  an  algorithm  for  the  proper  choice  of  the  F.  are 
deferred  to  a  future  paper. 

The  projections  of  the  vectors  il-.k  on  the  (e-i.  e.j)-plane  define  a  non-empty  time  cone 
within  that  plane.  Thus,  we  shall  select  the  transformations  F.  so  that  they  leave  the 
directions  e2  and  63  invariant,  therefore 


The  first  element  of  each  matrix  F.  is  chosen  to  result  in  a  non-empty  time-cone  for  the 
vectors  d-,*..  A  possible  choice  would  be  1,-1  and  1  for  F„,  F,  and  Fw.  respectively,  so 
the  translation  vectors  for  the  transformations  may  he  determined  and  a  time  vector  r 
selected,  for  instance 

du  =  (0,0.0)r.  dv  =  (rt.  1, 0)r.  dw  =  (  —  2.  1.  l)r.  r  =  ( 1,  ri,  n)T . 

In  this  schedule,  the  computations  start  at  time  n  +  1  and  end  at  «  —  1  +  2n(m  +  1). 
requiring  a  total  time  of  '2n(m  +  1)  —  1. 


8.  A  View  of  Future  Work 

This  last  section  conveys  an  idea  of  how  we  will  deal  with  affine  recurrence  equations.  We 
shall  use  an  illustration  and  consider  the  scheduling  of  a  step  whose  dependence  mappings  are  more 
general  than  translations  (uniform  recurrences)  although  they  still  represent  a  very  special  class 
of  affine  mappings.  Each  composition  of  dependence  mappings  associated  with  a  cycle  D-.k„{P)  = 
D~,kUP  - f  d-,kU  is  assumed  to  satisfy  the  following  property:  there  exists  a  translation  vector  d-.t„ 
such  that  D-.k„(P)  s  D-,kUP  ih.k „  satisfies  D-kU{C )  —  C.  where  C  is  the  same  domain  for  all  the 
variables  in  the  step.  Uniform  recurrences  are  obtained  when  the  matrices  D-:k„  are  all  equal  to 

/  <7  \ 

the  identity.  The  dependence  function  Dp,(i.j)  =  (j  +  2  -  i.j  -  1)  in  the  Toeplitz  factorisation 
algorithm  provides  a  practical  example  of  a  non-uniform  recurrence  that  belongs  to  the  class  under 
consideration. 

To  perform  a  computability  analysis  on  such  algorithms  we  need  to  study  the  eigenstructure 
of  the  compositions  of  dependence  mappings  around  elementary  cycles.  First  we  note  that  a  finite 
number  of  iterations  of  D-.kll  results  in  an  identity  transformation. 


Lemma  8.1. 


i  /  >  1  :  D‘iAP)  =  F 


Proof.  Since  C  is  finite  and  D-k„  is  bijective.  D-%u  is  a  permutation  of  the  element'  of  C. 
The  set  of  permutations  on  C  forms  a  finite  group,  with  (';!  elements.  The  power-  of 
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{/ ,  D-vfcu,  D.tklt . },  form  a  subgroup  which  contains  at  most  |C|!  elements,  hence  there  exists  a 

finite  least  l  >  1  such  that  D[,kU{P )  =  P  for  all  P  G  C. 

I 

Let  q  be  such  that  C  is  a  subset  of  Zq  and  is  not  included  in  a  proper  affine  subspace  of  Zq . 
The  rest  of  the  discussion  is  presented  for  q  =  2  but  similar  structural  results  may  be  established 
for  an  arbitrary  number  q  of  indices. 

Theorem  8.1.  In  Z2  either  D-ritU  =  I  and  d.lkU  =  0,  or 


D-tkU  — 


u  —  Q  _  |  I  F-tk"  an<^  "b  I)d-Iku  —  0. 


Proof.  In  this  proof  we  simplify  the  notation  and  drop  the  subscripts.  Because  D  is  an  affine 
transformation  one  has 

D‘{P)  =  D'P  +  (D‘~l  +  . ..  +  D  +  I)d  =  P  VP  e  C 

implying 

(d‘  -  i){P  -  p')  =  o  vp,p'ec. 

The  ability  to  select  q  linearly  independent  vectors  P,  —  P[  where  Pi.  PI  €  C  implies  that  Dl  =  I. 
from  which  (Dl~l  +  . . .  +  D  +  I)d  =  0  follows  (moreover  det  Dl  —  1  so  D  is  unimodular).  When 
q  —  2  the  Schur  decomposition  of  D  can  be  written  as 

D  =  F~l  f  1  F 


0  A; 


Suppose  Ai  =£  1  and  Ao  /  1,  then 


( D‘~l  +  ...  +  D  +  I)  =  (D-  -  /)  -  0. 

hence  D‘(P)  =  P  in  contradiction  to  computability.  Consequently,  at  least  one  of  the  eigenvalues 
must  equal  one.  say  Aj  =  1  and.  since  D  is  unimodular.  Aj  =  ±1. 


\o  =  1: 


Dl  =  F~l  J  ^  F  =  I 


implies 


hence  / 1  =  0  and  D  —  I.  Thus  l  =  1  and  d  =  0. 


A2  =  —  1: 


D  =  F  1  *  ^  jr  where  F  =  *  F. 


hence 


D1  =  F~ 


i  1  o 

0  -1 


F  =  /. 


Thus  l  =  2  and  ( D  -f  /)(/  =  0. 


Figure  8:  Illustration  of  the  Proof  of  Theorem  8.2. 

In  a  given  cycle  7 t,  the  linear  part  of  the  mapping  D-.kU  starting  at  a  variable  v  is  a  product  of 
the  linear  parts  of  the  individual  dependence  mappings  in  7*.  The  linear  parts  of  the  mappings  D-,kU 
and  D-,kV  starting  at  variables  u  and  v ,  respectively,  in  the  same  cycle  7*.  are  cyclic  permutations 
of  each  other.  Thus  the  eigenvalues  of  D-,k „  depend  only  on  the  cycle  7*  [10]  and  are  independent 
of  the  starting  variable  u.  Consequently,  when  q  —  2  each  elementary  cycle  can  be  associated  with 
one  of  two  types  of  dependence  mappings:  those  that  are  just  translations  or  those  possessing  as 
eigenvalues  1  and  —  1.  Note  that  the  type  of  mapping  for  a  given  cycle  is  independent  of  its  starting 
variable. 

For  each  composition  of  dependence  mappings  D(P)  =  DP  +  cl  associated  with  an  arbitrary 
cycle  starting  at  variable  u  there  exists  a  vector  d  for  which  D(P)  =  DP  +  d  satisfies  D(C)  =  C. 
The  arguments  in  Theorem  8.1  still  apply  and  the  linear  part  D  must  either  be  the  identity  or  have 
eigenvalues  1  and  —1.  The  eigenvalues  associated  with  the  cycle  will  be  shown  to  be  only  functions 
of  the  number  of  times  each  of  its  elementary  cycles  is  traversed.  The  fact  that  the  eigenvalues  are 
independent  of  the  starting  variable  constitutes  a  non-obvious  generalisation  of  the  above  result  for 
elementary  cycles  and  represents  a  step  towards  an  efficient  verification  of  computability. 

Our  key  idea  consists  of  performing  index  changes  F.  for  the  variables  so  that  the  linear  parts 
of  all  dependence  mappings  in  the  step  share  the  same  set  of  eigenvectors.  These  transformations  F. 
are  assigned  according  the  following  simple  rule:  construct  a  spanning  tree  T  for  the  step  and  go 
through  T  assigning  Fr  =  /  for  the  root  r  and  the  other  Fu  such  that  Dvu  =  I  for  each  arc  (u,  r) 
in  the  tree.  Thus  the  changes  of  index  ensure  that  the  linear  part  of  the  dependence  mappings 
corresponding  to  the  arcs  in  the  spanning  tree  is  equal  to  the  identity.  We  shall  now  prove  that, 
after  this  conversion,  the  linear  parts  of  the  dependence  mappings  are  either  the  identity  or  else 
are  equal  to  the  same  matrix  D  with  eigenvalues  1  and  —  I. 

Lemma  8.2.  In  Z2  two  cycles  with  non-identity  linear  part  and  incident  on  the  same  node  must 
have  the  same  linear  part  D. 

Proof.  Let  the  dependence  mappings  corresponding  to  these  two  cycles  be  D\(P)  and  D2{P). 
From  the  extension  of  Theorem  8.1  to  arbitrary  cycles  we  know  that  det[Di)  =  det\Di)  ~  -1. 
The  linear  part  D2D \  of  the  composition  of  these  two  dependence  mappings  Do  o  D\  must  have 
determinant  one  hence  it  must  be  identity,  using  again  the  extension  of  Theorem  S.l  to  arbitrary 
cycles.  Therefore  Do  =  Dfl  =  D\  since  D\  —  I. 

I 

Theorem  8.2.  In  Z2  if  two  cycles  in  a  strongly  connected  component  have  non-identity  linear  part 
then  they  have  the  same  linear  part  D. 

Proof.  Let  it  and  e.  respectively,  be  nodes  traversed  by  the  two  cycles,  see  Figure  S:  u  and  r 
are  assumed  to  be  distinct  otherwise  the  previous  lemma  applies  directly.  Let  F„  and  Dr  be  the 
associated  compositions  of  dependence  mappings.  Since  they  are  part  of  a  strongly  connected 
component  u  and  r  are  also  nodes  of  another  cycle.  From  Lemma  8.2  the  linear  parts  of  the 
compositions  associated  with  this  cycle  must  be  the  same,  say  D.  independently  of  the  starting 
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node.  Then  either  D  —  I  or  D  ^  /  and  then,  by  Lemma  8.2.  D  =  Du  and  D  =  Dv.  implying 
Du  =  Dv.  If  D  =  I  then  there  is  one  path,  say  Dln..  between  u  and  v  that  is  in  the  spanning 
tree  and  D  —  DUVDVU  (note  that  both  Duv  and  Dvu  may  be  compositions  of  individual  D.). 
Since  the  change  of  index  transformations  have  been  applied,  all  arcs  on  the  path  Duv  have  been 
assigned  identity  matrices,  so  that  Duv  =  /.  hence  Dru  =  /.  Since  Dul,  =  D,,u  —  I  the  mapping 
DUDUVDVDVU  ■  Cu  —  Cu  is  DUDV,  and  since  det(DuZ)t.)  =  1  it  follows  that  DUDV  —  I.  Thus 
Du  =  Dr. 

I 

Since  the  cycles  with  non-identity  linear  parts  have  the  same  linear  part  D.  the  change  of  index 
transformations  ensure  that  the  linear  part  of  each  dependence  mapping  automatically  assumes  the 
value  D  or  I.  Let 

D  =  F~l 

If  the  change  of  index  Fu  =  F  is  applied  to  every  variable,  the  dependence  mappings  are  of  the 
form 

Dvu{P)  =  1  1  P  +  dvu,  or  DVU[P)  =  1  P  +  dvu. 

It  is  now  clear  that  the  eigenvalues  associated  with  an  arbitrary  cycle  are  only  a  function  of  the 
number  of  times  each  of  its  elementary  cycles  is  t  raversed.  Equipped  with  this  result,  we  could  now 
carry  out  a  computability  analysis  similar  to  the  one  for  uniform  recurrence  equations. 

The  above  results  are  also  of  importance  for  the  scheduling  of  such  recurrence  equations. 
In  order  to  find  schedules  that  are  as  simple  as  possible  one  could  attempt  one  further  index 
transformation  rendering  the  dependence  vectors  constant  (independent  of  P).  On  each  variable  a 
■folding  transformation'  may  be  performed  that  consists  of  mapping  one  half  of  the  domain  onto 
the  other  half  by  folding  it  along  an  axis  parallel  to  the  erdirection.  The  folding  is  accompanied 
by  a  renaming  of  the  variables  associated  with  the  folded  half  of  the  domain  so  that  a  variable  is 
computed  only  once  for  each  point  in  the  final  domain.  The  dependence  vectors  become  independent 
of  P  (except  for  P  ’close'  to  the  folding  axis).  As  in  the  case  of  uniform  recurrence  equations, 
simple  schedules  associated  with  a  folded  domaiu  may  not  always  exist  even  though  the  algorithm 
is  computable.  One  must  then  give  up  constant  dependences  and  resort  to  more  complex  affine 
schedules. 
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